Jan 27, 2003 1:05am
proposal for CD-ROM archive - comments?
Hey there all,
Following my previous messages on this board (i posted as h0l211), i've been up to the Internet Archive in person to talk about helping out with the CD-ROM archive.
I then wrote up this informal proposal, which Brewster and others suggested that I post here for comments.
If anyone has any feedback about what you'd like to see, and _especially_ technical issues (backup formats and so on), please reply, or email me personally at h0l @ mono211.com if you'd prefer.
This proposal deals with the best way to archive the CD-ROMs in the Internet Archive's Macromedia collection. The collection comprises many thousands of CD-ROMs of PC, Mac, and PC/Mac format, mainly made between the years of 1994 and 2000.
Although a number of people online are (unofficially) archiving console software and game ROMs, nobody is making sure there are perfect digital copies and databases of the PC/Mac CD 'multimedia' boom and bust of the early and mid 90s. This is a _vital_ pre-broadband era where some of the first widely available ideas of 'virtual reality' and cinema-quality 3D graphics for the home were being explored (see 'Myst'!).
Although the Internet has now superceded a lot of the multimedia ideals the Macromedia collection stands for, that's precisely WHY the collection is important - as a document of what the era stands for. As an added impulse, the collection is stored on decayable CD media, and it's not strictly clear how long it will be until these discs will lose their reflective surfaces and become unplayable (some people claim 10 to 25 years!)
Making copies of the discs and their artwork now and storing them in a searchable database will help current and future historians of the era, and making the most interesting and relevant material available for download (with the full permission of the copyright holders!) will make people who love abandonware and free software VERY happy.
1. CD-ROM ARCHIVE FORMAT
The first important decision is how best to archive the discs as an exact copy, and then how best to distribute them to the public and use them in other ways.
The official FAQ for the newsgroup alt.binaries.cd.image recommends using an .ISO format for a CD that has one data-only track, and a .bin/.cue format for a Mode 2/Mixed Mode CD - ie, one that has a data track and multiple audio tracks. Another possibility is that the program is simple enough that the files could be extracted directly to hard disc from, say, a .ZIP file, and they would still run.
So this leaves us with 3 possibilities:
.BIN/.CUE - a 'perfect' digital copy of the disc. Needs to be burnt to disc before it will work, however.
.ISO or .ISO/.WAV - a copy of the disc that should be perfect if there are not any exotic copy protection or multiple audio files also on the disc. you can handle audio files as WAVs alongside ISOs, but re-burning them might be confusing.
.ZIP - a zipped-up version of the files contained on the disc.
These formats all have their advantages and disadvantages. I personally think we should discount .ZIP as a format because:
1. It's fairly easy to run ISOs as virtual CD-ROM drives on the PC - there's a simple setup for it. This will mean that we're really providing the CD-ROM 'as is' if we provide an ISO - it's a fairly pure version of the original disc which may also pass security checks to see if the CD-ROM is present.
2. It's also possible to extract files from ISOs easily with the Isobuster utility on PC. So if people don't like having virtual CD-ROM drives, they can just extract the files that way.
3. I wouldn't think .ZIP deals with dual PC/Mac format discs well at all, whereas .BIN/.CUE _should_, and .ISO _might_ - hah!
So .ISO is a good format, but I'm not sure it deals so well with multiple audio tracks. So my temptation right now would be:
- .BIN/.CUE for the 'master' copy of everything.
- .ISO for any CD that only has one data track.
- _maybe_ .ISO and .WAV for CDs with extra audio tracks - we need to research how easy it is to emulate and re-burn these.
We are, unfortunately, creating twice the data this way, though.
There's some Mac issues that need working through, but Macs can burn .ISO without any trouble, and Toast for Mac can burn .BIN/.CUE. Need to make sure backing up an .ISO from a PC won't negate the Mac-compatible bits of the disc, mind you - some testing needed.
[Multiple audio tracks are definitely an issue with a minority of the Macromedia collection, by the way, because CD-quality audio was one of the main draws of multimedia at that time, so many applications played music from the CD drive whilst the program was running.]
2. CD-ROM ARTWORK FORMAT
Eventually, scanning the entire manuals for posterity is deserved, time and funds permitting. Since we have a smaller amount of both for now, scanning the front and back covers of the CD-ROM and making all of them available online (whether the file image is available for download or not) would do a LOT to enhance the visual nature and attractiveness of the collection, especially for those titles that can't be downloaded.
So the suggestion for artwork for now is the front and back covers of the CD packaging _OR_ CD case only at the following sizes:
- master offline image - .TIFF at very high scan quality, not intended to be posted on the website.
- master online image - .JPG at size which enables you to read all text. You'll get this when you click on the thumbnails on the website.
- thumbnail online image - .JPG at small size, as with current thumbnails showing on site.
3. 'MACROMEDIA COLLECTION' CD-ROM ARCHIVE CONTENTS
It's important to recognise that ALL of the CD-ROMs in the collection are important. But equally, with such a large amount of CDs to sort through, I think the collection should be prioritised into three different areas.
1. PRIORITY - these CD-ROMs should be dealt with first, because they offer information that's not available elsewhere (a museum CD-ROM about totem poles, for example), they're good examples of multimedia from the time (an educational adventure about dinosaurs), or they're good pieces of cultural ephemera (the Betty Ford Clinic promotional CD-ROM or the 'magazine on a disc' ventures.)
2. NON-PRIORITY - these CD-ROMs are still important and should be dealt with when time and funds permit, but they either contain information that is NOT media rich (simple training programs which would be shown on webpages nowadays) or don't have the CD- ROM as its main focus (a music album with a small amount of added multimedia
3. JAPANESE-LANGUAGE - I suspect these discs should be separated out, because we need to look at compatibility issues with backing up (can you backup Japanese-language discs if you don't have J-Win installed?) and playing issues (do you need J-Win to run these discs?) If we can work out compatibility problems, we can then prioritise them into one of the two categories above.
4. 'INTERNET ARCHIVE COLLECTION' CD-ROM ARCHIVE CONTENTS
There is probably a new collection, which will at first be VERY small, which could be called the 'Internet Archive Collection', since that's who will be assembling it. The point of this is - when we come out and (re)launch the site, there needs to be at least SOME multimedia CD-ROM stuff on there to download that people will get excited about. Some of this may be cherry-picked from elsewhere than the Macromedia archive. Right now I'm particularly thinking of:
1. Voyager Company titles - this was the CD-ROM part of the well-known Criterion Collection laserdiscs and DVDs. We should definitely find out about whether this would be possible.
2. Cyan titles - the earlier pre-Myst titles from Cyan like ‘Cosmic Osmo' and 'Manhole' are resoundingly out of print. Got to be worth a try.
3. 'Total Distortion' from Joe Sparks and Pop Rocket – a classic proto-multimedia release from the guy who has now gone on to create Devildoll and Radiskull for Shockwave :)
4. 'Starship Titanic' interests me a lot, but I have no idea whether that would be a possibility. It was Douglas Adams' last CD-ROM project and is now out of print.
The rights issues for some of these are definitely problematic, though - please be aware that this is a wishlist and there's no guarantee ANY of the above will ever appear on the site :)
5. MAKING CD-ROMS REMOTELY ACCESSIBLE
I know this was one of the original goals of the project, and I've been looking a little at the technical issues. The problem definitely seems to be that most of this multimedia CD-ROMs play audio and video files, and I just don't see a possibility of them streaming properly over a normal broadband network with a VNC-like 'PCAnywhere' piece of software running. Looking at messageboards, people are having significant trouble just over their LAN. Simple Director-authored things with easy animations and links might work ok, but that's not necessarily where the meat of the interest in the collection lies, imho.
But it's CERTAINLY worth doing LAN tests to see if things will behave properly, with a view to making machines remotely accessible over either broadband (slowly) or Internet2 (quicker!) if other issues with security and suchlike can be resolved. If this could work, it would rock :)
This post was modified by simon c on 2003-01-27 09:05:43
Jan 28, 2003 6:39am
Re: proposal for CD-ROM archive - comments?
i'd just like to point out something about the bin/cue format... while it's a fairly useful format it's not without its problems.
while bin/cue images can consist of cooked sectors, the format is most versatile/useful when used as a collection of raw-mode sector reads (otherwise an iso could easily suffice).
since i can't explain it as well as some others, but have experienced _cdrom_generational_loss_ firsthand - stemming from raw-mode reads, here's a piece from the comp.periph.cdr faq from the section "can i make copies of copies?"
The heart of the problem is the way that that the data is read from the source device. When a program does "raw" sector reads, it gets the entire 2352-byte block, which includes the CD-ROM error correction data (ECC) for the sector. Instead of applying the ECC to the sector data, the drive just hands back the entire block, including any errors that couldn't be corrected by the first C1/C2 layer of error correction (see section (2-17)). When the block is written to the CD-R, the uncorrected errors are written along with it. This problem can be avoided by using "cooked" reads and writes. Rather than create an exact duplicate of the 2352-byte source sector, cooked reads pull off the error-corrected 2048-byte sector. The CD recorder regenerates the appropriate error correction when the data is written. Ideally SNAPSHOT[or other software - ed.] would be able to do the error correction in software when operating in "raw" mode, but apparently there's no readily available code that does this. It could also read each block twice, once in raw mode and once in cooked, but that would double the read time.
This begs the question, why not just use cooked writes all the time? First of all, some recorders (e.g. Philips CDD2000 and HP4020i) don't support cooked writes. (Some others will do cooked but can't do raw, e.g. the Pinnacle RCD-5040.) Second, not all discs use 2048-byte MODE-1 sectors. There is no true "cooked" mode for MODE-2 data tracks; even a block length of 2336 is considered raw, so using cooked reads won't prevent generation loss. It is important to emphasize that the error correction included in the data sector is a *second* layer of protection. A clean original disc may well have no uncorrectable errors, and will yield an exact duplicate even when copying in "raw" mode. After a few generations, though, the duplicates are likely to suffer some generation loss. The original version of this quote went on to comment that Plextor and Sony CD-ROM drives were not recommended for making copies of copies. The reason they were singled out is because they are the only drives that explicitly warned about this problem in their programming manuals.
It is possible that *all* CD-ROM drives behave the same way. (In fact, it is arguably the correct behavior... you want raw data, you get raw data.)
The final answer to this question is, you can safely make copies of copies, so long as the disc is a MODE-1 CD-ROM and you're using "cooked" writes. Copies made with "raw" writes may suffer generation loss because of uncorrected errors. Audio tracks don't have the second layer of ECC, and will be susceptible to the same generation loss as data discs duplicated in "raw" mode. Some drives may turn off some error-correcting features, such as dropped-sample interpolation, during digital audio extraction, or may only use them when extracting at 1x. If you want to find out what your drive is capable of, try extracting the same track from a CD several times at different speeds, then do a binary comparison on the results.
whether or not the underlying technical details are accurate for today's hardware and software, i have experienced cd generational loss firsthand, i can vouch for the fact that using raw-mode reads can truly screw your copies over.
based on the above and on my own personal experience (been doing copies/extractions/burns for 9+ yrs) i would not recommend the bin/cue format unless great care was taken to ensure that cooked sectors were used where possible. not all software does this for you. extraction followed by testing of copies might be the only way to determine what mode (cooked/raw) will work.
festering leper (never been 'burned' by cooked mode track reads :) )